Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unset request header writer GUID #36

Merged
merged 2 commits into from
Oct 24, 2024

Conversation

fuzzypixelz
Copy link
Member

@fuzzypixelz fuzzypixelz commented Oct 24, 2024

The gtest_multithreaded__rmw_zenoh_cpp test of package test_rclcpp is quite flaky in the ZettaScaleLabs:dev/1.0.0 branch. On around 1/10 runs I get:

executor taking a service server request from service '/multi_consumer_clients' unexpectedly failed: 
duplicate sequence number in the map, 
at /tests/workspace/src/rmw_zenoh/rmw_zenoh_cpp/src/rmw_zenoh.cpp:2457, 
at ./src/rcl/service.c:343

If I add a hacky log statement right before the call to add_to_query_map:

RMW_ZENOH_LOG_ERROR_NAMED("rmw_zenoh_cpp", "received query: gid=%zu, sn=%zu, thread_id=%zu", 
  (size_t*) request_header->request_id.writer_guid,
  request_header->request_id.sequence_number,
  std::this_thread::get_id());

I get queries with the same writer_guid and sequence_number:

38: [ERROR] [1729779404.278215962] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680
38: [ERROR] [1729779405.279791879] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680
38: [ERROR] [1729779406.280309755] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680
38: [ERROR] [1729779407.286302714] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680
38: [ERROR] [1729779408.287449964] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680
38: [ERROR] [1729779409.291142881] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680
38: [ERROR] [1729779410.294799507] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680
38: [ERROR] [1729779411.299335466] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680
38: [ERROR] [1729779412.300401299] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680
38: [ERROR] [1729779413.302097967] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680
38: [ERROR] [1729779414.308076925] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680
38: [ERROR] [1729779415.314352509] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680
38: [ERROR] [1729779416.319158093] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680
38: [ERROR] [1729779417.320878469] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680
38: [ERROR] [1729779418.322744052] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680
38: [ERROR] [1729779419.325122886] [rmw_zenoh_cpp]: received query: gid=281474046962464, sn=1, thread_id=281473285787680

Looking at the query attachment at the Zenoh level, I realized that the GUID is correctly propagated, but simply not set in rmw_take_request.

@fuzzypixelz fuzzypixelz added the bug Something isn't working label Oct 24, 2024
@fuzzypixelz fuzzypixelz self-assigned this Oct 24, 2024
@fuzzypixelz
Copy link
Member Author

This does not seem to be present on ros2/rmw_zenoh:rolling.

@fuzzypixelz
Copy link
Member Author

A good way of reducing the flakiness of the test is adding a delay at the start of the service callback.

@YuanYuYuan
Copy link
Collaborator

Thanks @fuzzypixelz. Can you clarify why adding a delay can pass the test? If this PR is necessary to add into rmw_take_request, I think it would 100% fail the test.

Copy link

@evshary evshary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides rmw_take_request, is it also needed in rmw_take_response?

rmw_zenoh_cpp/src/rmw_zenoh.cpp Outdated Show resolved Hide resolved
@fuzzypixelz
Copy link
Member Author

@YuanYuYuan The delay I referred to is just for making the test fail more consistently. In my tests, if the service is a bit slower then there is a greater chance for rmw_request to stumble upon duplicates.

This pull request doesn't assume any changes to the test.

@fuzzypixelz
Copy link
Member Author

Besides rmw_take_request, is it also needed in rmw_take_response?

Good catch!

@YuanYuYuan
Copy link
Collaborator

@YuanYuYuan The delay I referred to is just for making the test fail more consistently. In my tests, if the service is a bit slower then there is a greater chance for rmw_request to stumble upon duplicates.

This pull request doesn't assume any changes to the test.

I see. Then my question is why the original incorrect logic sometimes passes the test.

@fuzzypixelz
Copy link
Member Author

fuzzypixelz commented Oct 24, 2024

@YuanYuYuan The delay I referred to is just for making the test fail more consistently. In my tests, if the service is a bit slower then there is a greater chance for rmw_request to stumble upon duplicates.
This pull request doesn't assume any changes to the test.

I see. Then my question is why the original incorrect logic sometimes passes the test.

I suppose it's because the service/client were so fast that the duplicates never were present in the map at the same time (most of the time).

Copy link

@evshary evshary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@YuanYuYuan YuanYuYuan merged commit e6b7ecb into dev/1.0.0 Oct 24, 2024
7 checks passed
@YuanYuYuan
Copy link
Collaborator

I suppose it's because the service/client were so fast that the duplicates never were present in the map at the same time (most of the time).

Okay. Let's merge it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants